Skip to content

fix: Eliminate flaky parallel test failures caused by port collisions and timing races#558

Merged
hoffmaen merged 2 commits intocloudfoundry:developfrom
sap-contributions:fix-flaky-tests-port-collisions
Apr 27, 2026
Merged

fix: Eliminate flaky parallel test failures caused by port collisions and timing races#558
hoffmaen merged 2 commits intocloudfoundry:developfrom
sap-contributions:fix-flaky-tests-port-collisions

Conversation

@hoffmaen
Copy link
Copy Markdown
Contributor

@hoffmaen hoffmaen commented Apr 21, 2026

Summary

Eliminates a class of flaky parallel test failures in the gorouter test suite.

Port allocation overhaul (test_util/ports.go)

  • Replaced net.Listen(":0") (kernel-assigned ephemeral port, immediately closed) with a dedicated per-Ginkgo-process port range in [61000, 65534] — above the Linux default ephemeral range (32768–60999) so the OS never steals ports between allocation and listen().
  • Added ReservePort() / ReleasePort() / ReleaseAllPorts() to hold a listener open until an external process (gorouter binary, NATS server) is ready to bind, eliminating the TOCTOU race.
  • Rule: NextAvailPort() for in-process bindings; ReservePort() + ReleasePort()/ReleaseAllPorts() for external-process bindings.

Applied across all affected test suites

  • integration/ — all ports for gorouter and NATS binaries now use ReservePort(); ReleaseAllPorts() called just before every exec.Command(gorouterPath, ...) launch, including raw launch sites that bypassed the helper functions.
  • router/router_test.go, router_drain_test.go — NATS port uses ReservePort(); in-process router ports remain NextAvailPort().
  • mbus/subscriber_test.go, common/component_test.go — NATS port uses ReservePort().
  • Correct NewNATSRunnerReleasePortStart ordering enforced in all 6 call sites.

test/common/app.go

  • Listen() and TlsListen() now call net.Listen / tls.Listen eagerly before handing off to http.Server.Serve(), so the port is bound immediately (eliminates a secondary TOCTOU window for test app backends).

Flaky timing assertions fixed (4 test files)

  • proxy/round_tripper/proxy_round_tripper_test.gotime.Now() was captured after RoundTrip() + channel receive; deadline comparison used ±11 ms / ±6 ms tolerance. Fixed by capturing the reference time before the call and widening tolerance to the full timeout value.
  • handlers/requestinfo_test.go, handlers/reporter_test.go, proxy/proxy_test.go — same pattern; replaced with precise before ≤ timestamp ≤ after brackets.

HTTP 100 Continue test fix (proxy/proxy_test.go)

  • test_util.NewResponse() always sets Connection: close, which is correct for final responses but wrong for 100 Continue — the proxy would close the client connection after forwarding the header, causing a rare unexpected EOF on the subsequent read. Replaced NewResponse(100) with a plain http.Response literal (no headers) in both affected backend handlers.

Other

  • logger/logger.go — added fmt.Fprintf(os.Stderr, ...) before os.Exit(1) in Fatal() so the message is guaranteed to reach stderr even if the slog handler's buffer hasn't flushed.
  • Fixed hardcoded ports 6705/6706 in gdpr_test.go with ReservePort().
  • Stop gorouter before NATS in all AfterEach/StopAndCleanup blocks to prevent the subscriber's ClosedCB from calling log.Fatalos.Exit(1) and killing the test binary before Ginkgo can capture the result.

Backward Compatibility

Breaking Change? No

Note on AI usage

Parts of this code and tests were developed with assistance from Claude Code (claude-opus-4-20250514).

@hoffmaen hoffmaen force-pushed the fix-flaky-tests-port-collisions branch from b5fc538 to ae2d781 Compare April 21, 2026 19:44
@hoffmaen hoffmaen changed the title fix: Use ReservePort pattern to eliminate port collisions in tests fix: Fix flaky parallel test failures caused by port collisions and timing races Apr 22, 2026
@hoffmaen hoffmaen changed the title fix: Fix flaky parallel test failures caused by port collisions and timing races fix: Eliminate flaky parallel test failures caused by port collisions and timing races Apr 22, 2026
@hoffmaen hoffmaen force-pushed the fix-flaky-tests-port-collisions branch from 34c0c21 to 58acbe5 Compare April 22, 2026 07:55
@hoffmaen hoffmaen marked this pull request as ready for review April 22, 2026 08:31
@hoffmaen hoffmaen requested a review from a team as a code owner April 22, 2026 08:31
@hoffmaen hoffmaen marked this pull request as draft April 22, 2026 09:05
@hoffmaen hoffmaen marked this pull request as ready for review April 22, 2026 09:21
b1tamara
b1tamara previously approved these changes Apr 27, 2026
Copy link
Copy Markdown
Contributor

@b1tamara b1tamara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation Bot moved this from Inbox to Pending Merge | Prioritized in Application Runtime Platform Working Group Apr 27, 2026
- Assign each Ginkgo parallel process a dedicated port range in [61000,65534]
  (above the Linux ephemeral range) to prevent cross-process port collisions
- Add ReservePort/ReleasePort/ReleaseAllPorts to hold ports open until external
  processes (gorouter, NATS) bind them, eliminating the TOCTOU race
- Apply correct reserve→release→start ordering across all 6 NATS call sites
  and all gorouter exec.Command launch sites
- Fix flaky timing assertions in 4 test files by bracketing with before/after
  timestamps instead of comparing to time.Now() after the fact
- Fix HTTP 100 Continue test: replace NewResponse(100) with a header-less
  http.Response literal to avoid spurious Connection: close on the proxy
- Stop gorouter before NATS in AfterEach to prevent log.Fatal → os.Exit(1)
  killing the test binary before Ginkgo captures the result

fix: Fix additional findings
@hoffmaen hoffmaen merged commit 297cdc4 into cloudfoundry:develop Apr 27, 2026
8 checks passed
@github-project-automation github-project-automation Bot moved this from Pending Merge | Prioritized to Done in Application Runtime Platform Working Group Apr 27, 2026
@hoffmaen hoffmaen deleted the fix-flaky-tests-port-collisions branch April 27, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants